fix: Avoid unnecessary type casts in concat_ws#20436
fix: Avoid unnecessary type casts in concat_ws#20436neilconway wants to merge 3 commits intoapache:mainfrom
concat_ws#20436Conversation
|
I did a quick look at the changes and nothing obvious jumped out at me. I'll try and find time to do a more extensive review if no one else beats me to it. |
|
@Omega359 Thank you! |
|
🤖 |
|
🤖: Benchmark completed Details
|
| builder.append_offset(); | ||
| continue; | ||
| match return_datatype { | ||
| DataType::Utf8View => { |
There was a problem hiding this comment.
I wonder if all this duplicated code could be eliminated with an approach similar to
?There was a problem hiding this comment.
Yeah, I think that would make sense to do. I'm inclined to do it as a follow-up PR -- let me know if you'd prefer it as part of this PR.
| Ok(dt.to_owned()) | ||
| if arg_types.contains(&Utf8View) { | ||
| Ok(Utf8View) | ||
| } else if arg_types.contains(&LargeUtf8) { |
There was a problem hiding this comment.
I had a thought about this. I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32) https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.LargeUtf8
There was a problem hiding this comment.
Interesting point. I believe the typical precedence today is Utf8View > LargeUtf8 > Utf8, partly on the grounds that "StringArray to StringViewArray is cheap but not vice versa". I can see arguments for both sides; if we want to reconsider this, seems like a distinct issue?
datafusion/datafusion/expr-common/src/type_coercion/binary.rs
Lines 1663 to 1667 in e894a03
There was a problem hiding this comment.
Likely it should be another issue as it likely occurs in a few places. I am fairly certain I am correct on the proper type ordering here but in the wild I doubt it would be encountered much - just how many columns would have > 2 billion bytes?
There was a problem hiding this comment.
Also, this is really only pertinent in areas that pick a return type based on multiple columns. For the typical case where the udf is operating on a single column the existing logic should be fine - such as in btrim
There was a problem hiding this comment.
I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)
I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB
Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)
There was a problem hiding this comment.
I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)
I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB
Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)
Indeed, that was the point I was trying to get across. It's a rare ... but possible. Though honestly I expect DF would fail somewhere else pretty quickly if a column with data that big was ever encountered.
|
@alamb This is ready to be reviewed and/or merged, I think. |
Which issue does this PR close?
Rationale for this change
concat_wsreturnedUtf8, regardless of the input types it was called with. If it was called withLargeUtf8, returningUtf8might overflow. In general, functions like these should operate on all three string representations unless there is a compelling reason not to (e.g., this is howconcatworks).simplify_concat_wsalways constructed new literals with typeUtf8. This lead to unnecessary casts when its inputs were of a different string type.What changes are included in this PR?
concat_wsreturn type matching its input types, following howconcatdoes it.simplify_concat_ws, construct literals with the right type, not alwaysUtf8return_typeforconcatto be more readableStringViewArrayBuilderAPI more similar to the other string array builders, WRT null handlingAre these changes tested?
Yes.
Are there any user-facing changes?
Yes: some queries involving
concat_wswill now omit unnecessary cast operations, and the return type ofconcat_wsmight be any of the three string types. Generally these changes should match user expectations better than the previous behavior.